# High-resolution visual understanding
Llava UHD V2 Vicuna 7B
LLaVA-UHD v2 is an advanced multimodal large language model built around a hierarchical window transformer, capable of capturing different visual granularities through a high-resolution feature pyramid.
Multimodal Fusion
Transformers

L
YipengZhang
103
6
CLIP Convnext Large D 320.laion2B S29b B131k Ft
MIT
CLIP model based on ConvNeXt-Large architecture, trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks.
Text-to-Image
TensorBoard

C
laion
3,810
3
Featured Recommended AI Models